Expanding Reinforcement Learning Approaches for Efficient Crawling the Web

نویسندگان

  • Hamid Reza Motahari Nezhad
  • Ahmad Abdollahzadeh Barfourosh
چکیده

The amount of accessible information on World Wide Web is increasing rapidly, so that a general-purpose search engine cannot index everything on the Web. Focused crawlers have been proposed as a potential approach to overcome the coverage problem of search engines by limiting the domain of concentration of them. Focused crawling is a technique which is able to crawl particular topical portions of the World Wide Web quickly and efficiently by following only the most interesting links and not having to explore all Web pages. Much researches [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] have been done to improve the functionality and effectiveness of focused crawlers. The major problem of focused crawling is how to find or distinguish relevant pages to the focused topic(s) of the search. Using an approach to focused crawling which uses reinforcement learning to spider the Web efficiently [4,5,6], in Cora search engine, is three times more efficient than Breath-first crawlers, which are commonly used by general purpose search engines, and also is better than other focused crawling methods [5]. The most important feature of reinforcement learning which makes it appropriate for focused crawling of the Web is its ability to model the future reward of following of a hyperlink. So it can give an estimation of the amount of rewards (target pages) which an agent is able to earn in future by following a hyperlink. The main contribution of this paper is introducing an approach for expanding the crawling methods of Cora spider. The proposed approach has been designed, implemented and the results have been compared with Cora spiders, Breadth-First spider and well-known focused crawler approach [1]. We have introduced novel methods for calculating Q-Value in reinforcement learner spider. Our crawlers find the target pages faster and earn more rewards over the crawl than Cora's crawlers and other existing crawlers. Furthermore, we have used Support Vector Machines (SVMs) classifier for the first time as a text learner in Web crawlers and compared the results with crawlers which use Naïve Bayes (NB) classifier for this purpose. The results show that crawlers using SVMs outperform crawlers use NB in the first half of crawling a Web site and find the target pages more quickly. We have studied how different types of parameters and information from search context such gamma value in Q-Value equation, neighborhood text of hyperlinks, and the number of classes in classifier effects efficiency and effectiveness of crawlers. The test bed for evaluation of our approaches was Web sites of 4 computer science departments of 4 universities, which have been made available offline. Figure 1 shows that two reinforcement learning crawlers which use SVMs classifier outperform the crawlers which use NB in first 30% of the crawl and they are able to find target pages more rapidly which increase the amount of received reward. Figure 1Comparison of reinforcement learning spiders using SVMs and NB classifier in first 30 percent of crawl for two of implemented crawling methods. Figure 2 shows that two reinforcement learning crawlers which use SVMs classifier outperform the crawlers which use NB in whole course of the crawl and they constantly have a better effectiveness.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Deep Web Crawling Using Reinforcement Learning

Deep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as t...

متن کامل

FICA: A novel intelligent crawling algorithm based on reinforcement learning

The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, thus it has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for retrieving the most important pages remains a challenging issue. Several algorithms like PageRank and OPIC have been proposed. Un...

متن کامل

Learning to Surface Deep Web Content

We propose a novel deep web crawling framework based on reinforcement learning. The crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and submits a selected action (query) to the environment according to Q-value. Based on the framework we develop an adaptive crawling method. Experimental results show that it outperforms the state of ...

متن کامل

Teaching Reinforcement Learning using a Physical Robot

This paper presents a little crawling robot as a didactic instrument for teaching reinforcement learning. The robot learns a forwardwalking policy from scratch in less than 20 seconds of reinforced sensorimotor interactions. The state space consists of two discretized dimensions, where the behavior is visualizable and comprehensible. In laboratory tutorials, students conduct experiments with a ...

متن کامل

YAFC: Yet Another Focused Crawler

As the Web continues to grow rapidly, focused topic-specific Web crawlers will gain popularity over traditional general-purpose search engines for locating, indexing and keeping up to date information on the Web. This paper presents YAFC (Yet Another Focused Crawler), a neurodynamic programming approach to focused crawling. YAFC combines TD(λ) reinforcement learning with a neural network to lea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003